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A bstract-Auto mated cloud detection and tracking is an important 
step in assessing changes in radiation budgets associated with 
global climate change via remote sensing. Data products based on 
satellite imagery are available to the scientific community for 
studying trends in the Earth’s atmosphere. The data products 
include pixel-based cloud masks that assign cloud-cover 
classifications to pixels. Many cloud-mask algorithms have the 
form of decision trees. The decision trees employ sequential tests 
that scientists designed based on empirical astrophysics studies 
and simulations. Limitations of existing cloud masks restrict our 
ability to accurately track changes in cloud patterns over time. In 
a previous study we compared automatically learned decision 
trees to cloud masks included in Advanced Very High Resolution 
Radiometer (AVHRR) data products from the year 2000. In this 
paper we report the replication of the study for five-year data, 
and for a gold standard based on surface observations performed 
by scientists at weather stations in the British Islands. For our 
sample data, the accuracy of automatically learned decision trees 
was greater than the accuracy of the cloud masks p < 0.001. 

I. Introduction 

Understanding the role of clouds in the current climate is a 
prerequisite for predicting future climate change due to human 
activities [1], Satellite-born instruments continually acquire 
data about the Earth’s oceans, land, and atmosphere. The data 
is processed to derive high-level observations, which are then 
distributed to the scientific community via online data 
products. The data products include cloud masks, which have 
dual functionality. The masks designate locations in which the 
observations may have limited quality due to cloud 
interference, and also provide estimated cloud amounts for 
each location. The cloud mask of interest in this study is 
included in products derived from data acquired by the 
Advanced Very High Resolution Radiometer (AVHRR) 
instrument on board the NOAA-14 weather satellite of the 
National Oceanic and Atmospheric Administration. The mask 
is called Clouds from AVHRR, phase 1 (CLAVR-1) [2], 

The CLAVR-1 cloud mask is computed from measured 
reflectance and emission values using classification algorithms 
that scientists developed through experimentation with the 
data. To derive the algorithms, the scientists simulated clear- 
sky and cloud characteristics for a variety of surface and 
atmospheric conditions, and analyzed ambiguous 
manifestations of different physical phenomena, for example, 
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similar reflectance values for snow, ice and clouds. The 
algorithms employ sequential-threshold tests to arrive at 
decisions about the presence of clouds or about cloud 
composition [2-3]. The limitations of existing cloud masks [4] 
provided motivation for on-going research to develop 
improved cloud detection and characterization algorithms. 

Cloud-detection methods must disambiguate clouds and 
other entities that have characteristics similar to clouds. 
Scientists have used a variety of machine-learning methods to 
learn models for remote sensing data, for example, neural 
networks [5], Bayesian classification [6], kernel methods [7- 
10], genetic algorithms [11], classification trees [12] and 
regression trees [13]. The results of these approaches range 
from promising preliminary results to validated algorithms 
that are deployed in high-level remote-sensing data products 
[14]. Of these machine-learning methods, the methods that 
resemble the sequential tests in cloud masks the most are 
classification trees. Because of this resemblance, we use 
classification trees in this study, and refer to them as 
automatically learned decision trees (ALDT). 

In a previous study [15] we demonstrated the feasibility and 
potential of ALDT for improving the accuracy of cloud masks 
based on AVHRR data. In that study we compared cloud- 
detection results of the CLAVR-1 algorithm, which was 
devised by experts, to cloud- detection results of ALDT. We 
used ground observations collected by the National 
Aeronautics and Space Administration Clouds and the Earth’s 
Radiant Energy Systems S’COOL project as the gold standard. 
We found that for our sample data, the accuracy of ALDT was 
greater than the accuracy of the CLAVR-1 cloud masks, and 
that the difference in accuracy was statistically significant. 
The goal of this work was to corroborate the preliminary 
results in [15] by replicating the study and enhancing it in 
three ways: extending the time-period coverage of the sample 
data from one year to five years, using a refined ordinal scale 
for cloud quantity, and using a gold-standard generated by 
scientists. 

II. Background 

A. A VHRR Data 

The NOAA-14 AVHRR daily 8km global data product 
includes 12 scientific datasets (SDS), each of which 
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incorporates within a single plane a measured parameter, a 
flag, or a computed parameter. The SDS are: normalized 
difference vegetation index, CLAVR-1 cloud mask, quality 
control flag, scan angle, solar zenith angle, relative azimuth 
angle, surface reflectance in the visible wavelengths (channel 
1), surface reflectance in the near-infrared wavelengths 
(channel 2), surface brightness temperature in the thermal 
infrared wavelengths (channels 3-5), and acquisition day and 
time [16]. 

B. The CLAVR-1 Cloud Mask 

The CLAVR-1 algorithm includes four decision trees, one 
for each of daytime land scene, daytime ocean scene, 
nighttime land scene, and nighttime ocean scene. Each 
decision tree performs a series of threshold and uniformity 
tests on a 2x2 array of pixels, and classifies pixels as clear, 
mixed, or cloudy. The values used for each test are either 
retrieved channel values, or functions of retrieved values that 
incorporate acquisition parameters and estimates of emitted 
radiances [2]. Several tests were designed specifically to 
resolve ambiguities, for example, ambiguities due to 
reflectance greater than 44% in channel 1 or channel 2 for 
snow, ice, or sun glint. The thresholds used for the tests were 
derived empirically or via simulations of a variety of 
observation conditions as determined by cloud/surface/time 
combinations. 

The sequential decision process in CLAVR-1 discriminates 
between clouds, first by their gross characteristics, and then by 
their subtle characteristics. The algorithm ensures that pixels 
that fail all the tests have a very small probability of having 
radiatively significant clouds. The sequential-test nature of 
CLAVR-1 makes it similar to ALDT, but unlike the latter, the 
CLAVR-1 algorithm is not based on an exhaustive analysis of 
the data space. 

The CLAVR-1 algorithm has several limitations. First, the 
algorithm assumes that there is a representative sample of 
clear pixels in each image, however, this assumption does not 
hold for broadly overcast scenes. Second, the algorithm does 
not work well for polar-winter scenes or nighttime scenes, 
when only the thermal channels are available. Third, the 
ability of the algorithm to differentiate between clouds and 
other entities that appear as clouds in AVHRR images is 
limited. 

C. CLA VR-1 Evaluation 

Evaluation of cloud masks is difficult because there is no 
gold standard to which to compare the masks. Researchers 
estimate the quality of cloud masks by comparing them to 
masks produced by human analysts or by other algorithms. 
Stowe and colleagues [2] compared the results of CLAVR-1 to 
estimates of a human-expert analyst. Stowe’s team found that 
the mismatch between CLAVR-1 and the expert estimates was 
at least 10%, and that the mismatch varied for different cloud 
amounts, geographical location, and season. 


D. Decision Trees 

Decision trees are classifiers that employ rules sequentially 
to determine the class to which an item belongs. Decision 
trees can be learned automatically from training data for which 
the classes are known using a computer program that 
generates trees via sequential binary partitioning of the 
training data [17]. The learning procedure searches in the 
space of all possible decision trees that fit the data for an 
optimal tree, where the optimization criterion is minimal 
prediction error. 

III. Methods 

A. Data Preparation 

We obtained ground observations of cloud characteristics 
from the British Atmospheric Data Centre (BADC) [18]. The 
BADC data included observation of cloud amounts in 
numerous weather stations within the British Islands. We 
selected all observations that were available for the year 1996- 
2000 from 1238 weather stations. Then, we retrieved 8km 
daily AVHRR data that matched the BADC data in acquisition 
date, time, longitude and latitude. We excluded from this 
dataset all records that exhibited one of the following criteria: 
a. the AVHRR data quality flag indicated out-of-range values 
or processing errors; b. the CLAVR-1 mask had a no decision 
value; c. there was no BADC total-cloud-amount observation, 
or the observation value indicated that the estimate was 
affected by obscuring fog or other meteorological phenomena. 
The average number of records per year was 18632. We used 
the BADC observations as the gold standard for labeling 
training and test data. We compared the labels of the test data 
to predictions made for the same data by CLAVR-1 and by the 
ALDT. 

Although both CLAVR-1 and the BADC data utilized an 
ordinal scale for characterization of cloud amount, the scales 
were different and mapping one scale to the other could be 
done in more than one way. The CLAVR-1 mask had three 
possible values: clear, mixed, and cloudy. The BADC total 
cloud amount was specified in terms of okta, ranging from 0 - 
clear sky’ to 8 - 100% clouds. We mapped the BADC grades 
onto the CLAVR-1 grades in the following way: clear - 0-1 
okta, mixed - 2-5 okta, cloudy - 6-8 okta. 

B. Experiments 

We performed two experiments with the AVHRR data that 
we selected. The experiments differed in the set of variables 
that constituted the input to the decision-tree learning 
procedure. Experiment I included variables that represented 
sensor data: the radiances of channels 1 through 5, and the 
BADC label. Experiment II included the variables of 
Experiment I, as well as three additional function variables 
that are used within the CLAVR- 1 daytime-land algorithm [2] 
(see [15] for a more detailed description of the functions we 
used) . 



We randomly selected approximately 10% of the data 
points to form a dataset that would be used exclusively as a 
test set for validation. Then, for each of Experiment I and 
Experiment II, we used the remaining data to conducted 100 
bootstrapped [19] training trials to learn and evaluate multiple 
decision trees. For each trial, we randomly partitioned the data 
into a training set and a test set with a size ratio of 9:1. We 
learned a decision tree from the training set with the treefit 
procedure, which is an implementation of classification and 
regression trees [17] available within the MATLAB" statistics 
toolbox. We then classified the data in the corresponding test 
set as clear, mixed, or cloudy using the decision tree. We 
compared the classification results to the corresponding 
BADC observations. 

To measure accuracy for each experiment we computed 
two mismatch rates. First, we computed the rate of mismatch 
between classification results of the ALDT and the BADC 
observations. Second, we computed the rate of mismatch 
between the CLAVR-1 cloud masks and the BADC 
observations. We ran two-sided paired /-tests to determine if 
there were significant differences between rates of 
classification mismatch, for CLAVR-1 and for each of the 
decision trees, and for each pair of decision trees. Finally, we 
used the ALDT to classify the test set we had initially set 
aside, and we compared the rate of classification mismatch to 
that of CLAVR-1. 

IV. Results 

Table I lists the mean and standard deviation classification 
mismatch for each experiment. Note that the training sets used 
in the bootstrapping trials were not independent, and the test 
sets were not independent as well. However, the validation test 
set that was set aside initially was independent of all other 
sets. Columns 3, 4 in the table show mean and standard 
deviation of the rates of mismatch between CLAVR-1 and the 
gold standard, and between ALDT and the gold standard. The 
statistics were calculated for the 100 training trials for the five 
years. Columns 5, 6 show similar statistics calculated for the 
validation test set for the five years. In each of the two 
experiments, the difference in classification-mismatch rates 
between CLAVR-1 and ALDT was significant p < 0.001. 


Across experiments, the difference in classification-mismatch 
rates between ALDT was not statistically significant. 

V. Discussion 

The two experiments that we performed showed that ALDT 
classified 8km daily AVHRR data for the years 1996-2000 
more accurately than CLAVR-1. The two types of decision 
trees — trees based on only sensor data, and trees based on 
sensor data and functions of the sensor data — had similar 
accuracy. Thus the sensor data alone were sufficient to obtain 
an improvement over CLAVR-1. The sample data we used 
was limited in two ways. First, the sample was influenced by 
the availability of the BADC gold standard. Second, the 
sample was restricted to a geographical area that had a high 
prevalence of clouds. In effect, the ALDT were trained to 
predict BADC observations from AVHRR data. Thus, our 
ability to conclude the true presence or absence of clouds 
based on the results of ALDT depended on the accuracy of the 
BADC observations. 

The mismatch rates with respect to the gold standard that 
we obtained in this study were higher for both CLAVR-1 and 
ALDT compared to the initial feasibility study [15]. The 
higher rates could be explained in part by errors due to scale 
differences. Although scale differences occurred in [15] as 
well, the final scale in [15] had two values only: clear and 
cloudy, and it was CLAVR-1 that was down-scaled, from 
three to two levels. In this study, the CLAVR-1 scale was 
coarser than the BADC scale, and to match the CLAVR-1 
scale we down-scaled BADC data from nine to three levels. 
Here, the mapping between scales involved loss of 
information in the gold-standard. 

We chose to use AVHRR data and the CLAVR-1 mask to 
demonstrate the feasibility and contribution of ALDT because 
of their relative simplicity. The promising results that we 
obtained indicate that it would be a worthy effort to replicate 
the study with data acquired by more advanced instruments 
such as the Moderate-Resolution Imaging Spectroradiometer 
(MODIS) and with the corresponding newer versions of the 
CLAVR mask. Two possible extensions of this work are to 
use ALDT to classify clouds into multiple could types, and to 


TABLE I 

Classification Mismatch Rates with Respect to the Gold Standard 


Experiment 

Method 

Mean - 
training 

Standard 
deviation - 
training 

Mean - 
validation 

Standard 
deviation - 
validation 

i 

CLAVR-1 

0.523 

0.05 

0.494 

0.02 

i 

Decision trees based on only 
channels 

0.287 

0.014 

0.278 

0.02 

ii 

CLAVR-1 

0.522 

0.048 

0.497 

0.022 

ii 

Decision trees based on 

0.286 

0.015 

0.277 

0.013 


channels, acquisition parameters, 
and functions 



use regression trees to predict the amount of cloud cover and 
visual opacities. 

VI. Conclusion 

Our work demonstrated that a sequential testing approach 
similar to that used by experts, combined with a 
comprehensive analysis of training data via an automated 
procedure for learning decision trees, contributed to the 
development of an improved cloud mask. 
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